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@ A write buffer for a superpipelined, superscalar microprocessor. 



(g?) A superscalar, superpipelined microproces- 
^ sor having a write buffer located between the 
central processing unit core and memory 
cache. The write buffer stores the results of 
write operations to memory until the c^che 
memory becomes available, i.e., when no high- 
-priority reads are to be perfomned. The wnte 
buffer includes multiple entries that are split 
into two circular buffer sections for facilitating 
the interaction with the two core pipelines 
Cross-dependency tables are provided for each 
write buffer entry to ensure that the data is 
written from the write buffer to meniory in 
program order, while considering any pnor data 
in the opposite section. Non-cacheable reads 
from memoiy are also ordered in program order 
with the writing of data from the wnte buffer. 
Features for .perfomning misaligned wntes. 
handling speculative execution, detecting and 
handling data dependencies and .exceptions, 
and p rforming gathered writes are also in- 
cluded within the microprocessor. 
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Background of the Invention 

1. Field of the Invention 

This invention is in the field of integrated circuits of the microprocessor type, and is more specifically di- 
rected to memory access circuitry in the same. 

2. Description of Related Art 

in the field of microprocessors, the number of instructions executed per second is a primary performance 
measure. As is well known in the art. many factors in the design and manufacture of a 
Zs measure. For example, the execution rate depends quite strongly on the clock frequency o the m cropro- 
ceLor The f requency of the clock applied to a microprocessor is limited, however, by power d.ss.pation con- 
cerns and by the switching characteristics of the transistors in the microprocessor. 

The architecture of the microprocessor is also a significant factor in the execution rate of a .n^'croprocesso. 
For example many modern microprocessors utilize a "pipelined" architecture to improve the.r execution rate 
ff ma^Hf hertnstructions require multiple clock cycles for execution. According to conventional pipelining 
Ichies eTch microprocessor instruction' is segmented into several stages, and separate circuitry is pre 
Sded rperform ea h stage of the instruction. The execution rate of the microprocessor is thus increased by 
overtpfJnT^^^^^^^ different stages of multiple instructions in each dock cycle. In this way. one mul- 

tinlA-cvcle instruction may be completed in each dock cycle. 
' By ta 0 "Lr background'some microprocessor architectures are of the "-P^^^^^^'f ^^VP/-;^^-^ 
multip^einstructions are issued ineachclockcydefor execution in parallel Assuming nodep^ 

instructions, the increase in instruction throughput is proportional to the degree of s^'^b^'ty- 

Anothe known technique for improving the execution rate of a microprocessor and the system m which it 
is imp^mentedTL use of a cache memory. Conventional cache memories are small ^-gh-speed menn^|e^^ . 
haXe program and data from memory locations which are likely to be accessed m performing later instruc- 
t?ns ^deX^ed by a selection algorithm. Since the cache memory can be accessed in a reduced numbe^ 
orclock eye e'often a single cycle) relative to main system memory, the effective -'^^^^J^^ 
oroTessor utiiiz ng a cache's much Improved overa non-cache system. Many cache memories are located on 
fhTsaTe in egrated circuit chip as the microprocessor itself, providing further performance improvement. 

Ac'STeachofthesea^^^^^^^^ 
occur that slow the microprocessor performance. For example, in both the pipelined and the superscalar ar- 
SLu'es rultiple Instructions may require access to the same internal circuitry at the same time, in which 
ca:o of\S^^ 

One type of such a conflict often occurs where one instruction requests a wnte to memo y (including 
cache at trsle t m that another instruction requests a read from the memory. If ^ 
vfced in a 'Sst-come-first-served- basis, the later-arriving instruction will have to waitforthe completion of a 
X!!!^^S^^^9r^r^' — y access.These and otherstalls are. of course, detnmentai to micro- 

'~b^"n7sco^^^^^ that, for most instruction sequences (i.e.. programs), reads from memory or cache 
sre TneraTy mo e^me-critical than writes to memory or cache, especially where a large number of general- 
plose e^ f^^^^^^^^ in the mic«.processor architecture. This is because the instructions and input 

Sa a are necessary at specif ic times in the execution of the program in order for the program to execute in an 
efficient manneTin cLnt^^^^ since writes to memory are merely wrfting the result of the program execution 
th?actlTrme at whTch the writing occurs is not as critical since the execution of later instructions may not 

'^Ty :a7ofTurrr'iackground. write buffers have been provided in microprocessors, such write buffeis 
are ,o iS tated between on-chip cache memory and the bus to main --^^-J^-^^^^^^^^^^^^^^ 
rarhP write buffers receive data from the cache for a write-through or write-back operation, the contents ot 
"e posT^he I^^bSIIr ^ to main memory under the control of the bus controller, at times when 

tl7^TunT^:;.,roun6. many modern microprocessors can access memory location jsing a^^ 
dres^L that are not necessarily a modulo of the operand size. An example of a microprocessor type «hid, 
fhTsTs thl case a^^^^^^^^ commonly referred to as being "X86" compatible. In these microprocessors therefore, 
some meTo y wSfsTay ndude bytes which areoutside of the byte block containing the lowestbyte ad^ess 
suIt^Lt mulCw^^^^^^^ are required to accomplish the write operation. These writes are often referred 
o as "mtX d w^te^^^^^ a significant fraction of memory writes may overlap byte-block boundanes 
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"r,CSteb'ackg™und, mlcroprooe^sors i„*dl« both an integ., central processing unit and a ' 
By way 01 lunner ua s In such mioroprocessors. the data »ord width of the integer re- 

floating POW P'"'"^™' ^2 ^e fSTng ioW° nlC to Lmple, integer data may be thi,ty...o bits wide 

~£:rbr™=^^^^ 

trairSroretoLheo,™^ 

pravlded by the programmer, to ensure "•^'^'^'^'"'^'^ a„d counter. However. 

" :srrrbr,rr:cr.iTe'™=^^ 

counterislnadeduale. „|3„,n„„wnformicroprocessors of conventional architeaures, such as 

thostCng s';S:d~^^^^^^ .0 effect wdte operations of b„e sizes sm.ier than the capeolt, 

15 ofthe internal data bus. • ^,i„»w mirrnnrocessors are known to be vulnerable to certain hazards 

By way of further-background. P'P^''.";^^,"'''^™^ ^^'^^^^^^^ arise when two instructions at dif- 

commonly refenred to as data '^^P^"'!-;'^ ^ " ^^^^^^ location, as the pipeline may access 

ferentstagesinthe P'P« "! ord'er) before the earlier instruction has 
the register or memory location for the later instmcuon xechniaues for detecting such data dependencies 

20 written data thereto, which results in Jfas descrl^^^^^^^^^^^ and Hennessy. Conv 

in conventional pipelined microprocessors are known ^^^^^^^^^^^ ^Sr-TS. According to conventional 

r the pipeline until the earlier In- 

it^u^si^^^rgroir^^^^^^^^^^ 

Bywayoffurthe^ackgrou^^^^^^ 

lative execution m order to mamtam the p^^^^^^^^ 'ires that predictive branching be performed, where 
entinthep«.gramseque^^^^^^^^^^^^ 

the microprocessor predicts whether the conaition ^ speculative executed in- 

30 the predicted path '^^f- '^I'J^ nCedic«ont incorrect, it may bediff icult or im- 

;:=i::ermte=^ 

35 

Summary of the Invention 

asecondwritebufferentryisallocatedwth hefirstw ntebuff^^^^^^^^^ 

and stored therein. Upon retiring of the misaligned wnte the ^ f f ^^^^ ,ersa A latch is 

Another aspect of the — ^^^^^ 

to be constructed w th the ^,^2^: J^^^^^^^^ of words produced by the central processing unit. Asec- 
50 results in words that are wider (in bits) than "^^^^'^^ . -,Qcessina unit with a control bit being set 

ondary data latch is provided to store the ^^''^^ "^^^^^^^^^^^^^ ph^si^raddress corresponding 

when the data therein is valid. Astandard ol Sre^^d s^^^^^^ that its data will be stored 

sfhr^r^da^^-iiS^^^^^^^ 

« l^e^ZtcontentsofthewritebufferentrywU,^ 

Another aspect of the Invention « '^'^*=c,'^^^^^^^ eff icl^icy. Program order 

t r.rd';t:Ca -'^ "-^"^ 
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of each write buffer entry with a map showing which write buffer entries in the opposite section have already 
beeraliocl^^^^^^^ cross-dependency fields in each write buffer entry are cleared, bit by b>t. as each wnte 
Srt r?er in program order's retired. Retiring of a write buffer entry in program order .s ensured by requiring 
Kf^rosidepenSency field to be clear. Additionally, a similar concept may be used to ensure the performance 
iTn- acheabt e d i" P-gram order with retiring from a wrrte buffer, by p^viding a c-ss-depend^^^^^^^ 
Eleld forthe read that is a map of the allocated write buffer entries at the time the read is allocated, with gating 

^^'torh? L^pt includes provisions for performing gathered writes from the write buffer 

to tht cache During allocation of the write buffer entries, comparisons are made between the physical address 
of currently a^^^^^ entry and previously allocated to determine if. at least, the physical addresses allocated 
•fre wS thrsame byte gLp. in which case the multiple writes may be gatherable. or mergeable. into a sing e 
wrlrl fon to the^^ Other constraints on gatherability can Include that the bytes are con iguous w, h 
rfanothe and Zt the writes are from adjacent write instructions in program order. Retiring of gatherable 
wnSfer entries is effected by loading a latch with the data from the wrfte buffer entries, after shifting of 
, Te data lo p"ce it the proper byte lanes; the wr^e is effected by presentation of the address m combination 

""'rotrr'SrecJonhf invention includes provisions for detecting data hazards or dependencies such as 
read itlrSFiw) dependencies, particularly relatK^e to data already written to the wnte buffers. Refnng 
7^,S7JfTo^2 is prevented for those entries subject to a RAW dependency, thus avoiding erroneous 
. ?Iads Ca^ab m^^^^^^^ also be provided for sourcing data directly from the write buffer, or ever, bypassing the 
w^bSS to Sthe effect of pipeline stalls due to RAW hazards. Further capability may a so be provided 
.roLtStheTastofmultlplereadsUbesourced 

"rotra^::torinrnrincludesaspeculatlvee 
5 entry Where writes to the write buffer during speculative execution are allowed. Each contro b>t corresponds 
to a predicive 0 speculaf..e branch, and is set upon allocation of a write buffer entry according o he de ^ 
SsDeilation of the write. In the event of a misprediction, each write buffer entry having its speculative oo trol 
bl settr he is flushed, so that the write buffer entry t--^-- ""rJ^b"^ are 

E^eXn handling may be accomplished by clearing all write buffer entries that have been allocated but a e 

,0 Lt'etr^^^^^^^^^^^^^ 

nnintpr<5 and allocation Dointers to match when the buffer is empty. \ . . . « 

' t " "eret an oSect of the present invention to provide a microprocessor architecture which buffer 
the wn^ ng o'data?rom Ihe CPU core into a write buffer, prior to retiring of the data to a cache, and .n which 
■ misalianed writes may be easily handled with minimal loss of performance. 
35 It is a f uX obj^^^^ of the present invention to provide a microprocessor architecture which al °ws fo^^^^^^^^^ 
age execution reiults in a write buffer prior to retiring data to cache or memory. -^^^^^^J 
a Diuralitv of locations of a smaller bit width than that provided by a secondary processing unit 

Itt a f urthe' object of the present invention to provide such an architecture where buffering is provided 
for tile reslTtfof the^econdary^^^ unH without requiring all write buffer locations to be constructed 

" ^^TraTu'rt™^^^^^ inventionto provide a microprocessor architecture which buffers the 

writing of It f rom the CPU co'r e into a wrfte buffer, prior to retiring of the data to a cache, where the write 

45 data from the write buffer to cache or main memory in program order. .t,„ „^^„rminn 

It is a fulerobjectof the present invention to p«)vide such an architecture which allowsforthe performing 
of non-cacheable reads in program order with the retiring of data from the wnte buffer 

It is a further object of the present invention to provide a microprocessor architecture ^^h^^ « '°^J°^^ 
age execut^^^ 

50 is provided to store the write data from multiple write operations for presentation from the write buffer to the 
"'uil'urrSiofthepresentinventiontoprovideford^^^^^^^^ 

-^alu^S^^^^^^^^^^ 
55 age of execution results in a wrte buffer prior to retiring data to cache or memory in a manner in which data 

cation that an otherwise apparent data dependency is in fact not a data dependency. 



4 
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It is a further object of the present'invention to provide such an architecture which is implemented in a . . 
suDerDiDelined superscalar microprocessor architecture. k' u u « »u 

I is a further object of the present invention to provide a microprocessor architecture which buffers the 
writing cfdatafromtheCPUcorelntoawrite buffer, priorto retiring of thedatatoacac 
from soeculative execution or exceptions can be readily performed. 

Other obLs and advantages of the present invention will be apparent to those of ordinary skill in the art 
having reference to the following specification in combination with the drawings. 

Brief Description of the Drawings 

Fiqure 1 a illustrates a block diagram of the overall microprocessor.. 
Fiqure 1b illustrates a generalized block diagram of the instruction pipeline stages. 
Fiqure 2 illustrates a block diagram of a processor system using the microprocessor. 
F igur^ 3 ustrates a timing diagram showing the flow of instructions through the pipeline stages. 
Figu^ 4 is an electrical diagram, in block form, of the write buffer in the microprocessor of Figure 1 a ac- 
cordino to the preferred embodiment of the invention. ^„ 

S. rJs is a reoresentation of the contents of one of the entries in the write buffer of Figure 4. 
FiruIeeisaflowcLStratingtheall^ 

^'Ce ?Tstr:;Satiin Of the physical address comparison process in the allocation of Figu. 6. 

Fiau e 8 is a map of the address valid bits of the cross-dependency field for a wrrte buffer entry for one 
pipeSSthe micZoLssor of Figure 1a relative to the address valid bits of the write buffer entries for the 

""n^gStsa^tS^^^ 

°' %'igurioTs a flow chart illustrating the retiring of a write buffer entry according to the preferred embodi- 
""'"Rgure 11 is a f low chart Illustrating a method for detecting and handling dependency hazards according 
"^ig^sTa^a^Sri^^^^^^^ 

"'Cre 14 is a flow chart illustrating a method for allocating write buffer locations for misaligned write op- 

n-riTrr^^^^^^^^ 

T^::::^^^^^^^^^-^- as used in the mlcropro. 
^<.o=,,r PinnrP 1a according to the preferred embodiment of the invention. 

Fiqure^^ 1^^^^ Ind ?b are f,ow charts illustrating the allocation and retiring sequences, respectively, of a 
non cacTeable read operation according to the preferred embodiment of the invention. 

Detailed Description of the Preferred Embodiment 

The detailed description of an exemplary embodiment of the microprocessor of the present invention is 

organized as follows: 

1. Exemplary processor system 

2. Generalized pipeline architecture 

3 Write buffer architecture and operation 

4'. Read-after-write hazard detection and write buffer operation 

5. Speculative execution and exception handling 

6. Special write cycles from the write buffer 

7. Conclusion corresponding headings used in this detailed description, are provided 
fort r= ° r^^^^^^^^^^^^^ 

Sssor are omitted as to not obscure the description of the invention with unnecessary detail. 
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1. Exemplary Processor System 

The exemplary processor system is shown in Figures la, 1b. and Figure 2. Figures 1a and 1 b respectively 
illustrate the basic functional blocks of the exemplary superscalar, superpipelined microprocessor along with 
the pipe stages of the two execution pipelines. Figure 2 illustrates an exemplary processor system (mother- 
board) design using the microprocessor, v. 



1.1 Microprocessor 

10 Referring to Figure 1a. the major sub-blocks of a microprocessor 10 include: (a) central processing unit 
(CPU) core 20. (b) prefetch buffer 30 , (c) prefetcher 35. (d) branch processing unit (BPU) 40, (e) address trans- 
lation unit (ATU 50. and (f) unified 16 Kbyte code/data cache 60. including TAG RAM 62. A 256 byte >nstruct.on 
line cache 65 provides a primary instruction cache to reduce instruction fetches to the unified cache, which 
operates as a secondary instruction cache. An onboard floating point unit (FPU) 70 executes floating point 

15 instructions issued to it by the CPU core 20. ^r^^-r,^ a occ Kif /lo ^^w^o^ 

The microprocessor uses internal 32-bit address and 64-bit data buses ADS and DATA. A256 bit (32 byte) 
prefetch bus (PFB). corresponding to the 32 byte line size of the unified cache 60 and the instruction line cache 
65 allows a full line of 32 instruction bytes to be transferred to the instruction line cache in a single clock. 
Interface to external 32 bit address and 64 bit data buses is through a bus interface unit (BlU). 

20 The CPU core 20 is a superscalar design with two execution pipes X and Y. It includes an instruction de- 
coder 21. address calculation units 22X and 22Y. execution units 23X and 23Y. and a register file 24 with 32 
32-bit registers. An AC control unit 25 includes a register translation unit 25a with a register scoreboard and 
registerrenamlng hardware. Amicrocontrol unit26. including a microsequencer and microROM. provides exe- 

25 W,1t°es7rom CPU core 20 are queued into twelve 32 bit^rite buffers 29 - write buffer allocation is 
performed by the AC control unit 25. These write buffers provide an interface for writes to the unified cache 
60 - noncacheable writes go directly from the write buffers to external memory. The write buffer logicsupports 
optional read sourcing and write gathering. . . , ^. , ■ m • .»„,^ 

Apipe control unit 28 controls instruction flow through the execution pipes, including: keeping the instruc- 
30 tions in order until it is detemiined that an instruction will not cause an exception; squashing bubbles in the 
instruction stream; and flushing the execution pipes behind branches that are mispredicted and instructions 
that cause an exception. For each stage, the pipe control unit keeps track of which execution pipe contains 
the earliest instruction, provides a "stall" output and receives a "delay" input. 

BPU 40 predicts the direction of branches (taken or not taken), and provides target addresses for predicted 
35 taken branches and unconditional change of flow instructions Qumps, calls, returns). In addition, it monitors 
speculative execution in the case of branches and floating point instructions, i.e.. the execution of instructions 
speculatively issued after branches which may turn out to be mispredicted, and floating point instructions is- 
sued to the FPU 70 which may fault after the speculatively issued instructions have "'"P'^tf 
a floating point instruction faults, or if a branch is mispredicted (which will not be known until the EX or WB 
40 stage for the branch), then the execution pipeline must be repaired to the point of the faulting or mispredicted 
instruction (i.e... the execution pipeline is flushed behind that instruction), and instruction fetch restarted 

Pipeline repair is accomplished by creating checkpoints of the processor state at each pipe stage as a float- 
ing point or predicted branch instruction enters that stage. For these checkpointed instructions all resources 
(programmer visible registers, instruction pointer, condition code register) thatcan be modified by succeeding 
45 speculatively issued instructions are checkpointed. If a checkpointed floating point instruction faults or a 
checkpointed branch is mispredicted, the execution pipeline is flushed behind.the checkpoirited instruchon -- 
for floating point instructions, this will typically mean flushing the entire execution pipeline, while for a mis- 
■ predicted branch there may be a paired instruction in EX and two instructions in WB that would be allowed to 

50 """p'or the exemplary microprocessor 10. the principle constraints on the degree of speculation are: (a) spec- 
ulative execution is allowed for only up to fourfloating point or branch instructions at a time (i.e.. the specul^^^^^^ 

level is maximum 4). and (b) a write or floating point store will not complete to '^^^^^^'^'^Xten L"!^ or 
until the associated branch or floating point instruction has been resolved (ue.. the prediction is correct, or 
floating point instruction does not fault). . , . . di i r^^ni^^ompnt al- 

55 The unified cache 60 is 4.way set associative (with a 4k set size), using a pseudo-LRU r^P ^cement al 
gorithm. with write-through and write-back modes. It is dual ported (through banking) to pemii ^^J ^^^^^^ 
accesses (data read, instruction fetch, or data write) per clock. The instruction line cache is a fully associative, 
lookaside implementation (relative to the unified cache), using an LRU replacement algonthm. 
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The FPU 70 includes a load/store stage with 4-deep load and store queues a conversion stage (32-b.t o . 
80 Jt e^nded ormat). and an Execution stage. Loads are controlled by the CPU core 20. and cacheable 
^to^el a?e dtectedl^ the write buffers 29 (i.e.. a write buffer is allocated for each floafng pent store 

Siino to Fiaure 1 b the microprocessor has seven-stage X and Y execution pipelines: instruction fetch 
(IF) two — 

write-back (WB) Note that the complex ID and AC pipe stages are superpipelined ■ 

Th^ IF iVaae orovides a continuous code stream into the CPU core 20. The prefetcher 35 fetches 16 bytes 
of inltru^n datS 1 the Prefetch buffer 30 from either the (primary) instruction line cache 65 or the (sec- 
nnST^ZedlTe 60. BPU 40 accessed Wth the prefetch address, and supplies target addresses to 
ondary umf led '^^^^^^^^^^^^ ^ allowing the prefetcher to shiftto a new code stream in one clock. 

21 rltr^ves 16%yS in^ data from the prefetch buffer 30 each clock. In ID 1 the eng h of two 

21 retrieves 10"^^^ ^ Y execution pipes) to obtain the X and Y instruction poin- 

instructions ,s ^^^""^/y^,^^^^^^^ back to the prefetch buffer (which then increments 

for the "^'^ Jf 3 operands are separated. The 1D2 stage completes decoding the X and Y 

and immediate and decoding addressing modes and register fields, 

instructions, generating ^"^/y P° "'^^^^^^^^ 3^ instruction is determined, and the instruction is 

Zx ^or the e e^^^^^^^^^^ '-^-^''-^ '^^"^'^ ""'^ ^ pipeline: change of flow 

AC1X. For the exemp^a y exclusive instructions. Exclusive instructions include: any instruc- 

instnictions.f oa ing point in^^^^^^^^^ 

Hon thatmayfaultm the EXpipe stage and cera^n w Multiply/Divide. Input/Output. Push All/Pop 

issue^al^^^^^^ fmm the ID stage (i.e.. they are not paired with any other instructon). Except for these 
.anyinstruc^^^^^^^ 

ory operands. The AC1 stage ="'3 « ^ dependencies are also checked and resolved 

addresses, which are '-'^ 'J ^^.^"f^^J^^^^ renaming hardware) - the 32 physical 

are :rb'y:::cLittoLold the delay r^^^^^^^^^^^ 

cessing register operands for address <^"'^^°:- ^^^^^^^^^^ dependencies) before accessing 
until there^uireddataint^^^^^^^^^^^^ 

those registers. Dunng the AC2 stage. '»^J^9'^^^^^^^^^^ ^^^^ ^ ^^^^^^^^ register. 

tables in memory and workspace .^^9!^'!^;" ^^'^^^^^^^^^ untranslated address (available 

tagged to permit, when address translation is enabled. ^^^^ ^TU 50 (available 

at'the end of AC1) and, for each set. 'fa , ^^P-^";^^^^^^^ perfom^ed in AC2. 

early in AC2). Checks for any segmentation ^jf °; .^f/^J^^^^^^ ^^t cause an exception. For most 
Instructions are kept in program order until '^j^^'^^^^^^/i^/a^^,^^^^^^^^^^ instructions and certain exclusive 
instructions, this determination IS made dunng or before AC^^^^^^^^^ 

instructions may cause --P«°-„^"X^^^^^^^^^^ that may still cause an 

' ii^rL^^rdS:^^^^^^ 

exceptions in order is ensured. .grform the operations defined by the instruction. Instructions spend 
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Phase2(PH2) of AC2. 



1.2 System 



Re,.^„g .0 Figure 2, (or .he exemplar, J^"^^^^^^^ 
mat includes a single chip memory and bus controller ^^J^' '^fJ'V^^^^^^^^^ „„„ se 

irrn;r=;rrer=rr;oi;h^^^^ 

which reduces its pin count and cost)^ ^^^^^ ^ ^.^^ ^jat3 p^rt 

HT321) provides standard interfaces to a 32 M VL ' ^tstsN^^ „^^^^^^^^^ g, Interfaces 

„ 92.kayboardcontroller93andlOch,p94as„^^^^^^^^^ 

Sr?:r."TUThSs ,rr;r "^^^^^^^^ - .hrouV a ...ectlonal isolation «,er 

98 to the low double word [31 :01 of the 64 bit processor data bus. 

20 2. Generalized pipeline architecture 

.,re3i..ustratesanexa.,eofthepe_^^^^^ 

ping execution of the instructions, for a two pipel ne a^^^^'t^^'^^^; ^''"^^^ ^ ^ ^-^^ of microprocessor 

L each pipeline couid also be proyided. In '^J^^^'i:^^'^^^^^^ externa, system 

25 10 is synchronous with internal clock s.gnal of system clock signal 124. 

clock signal 124. In Figure 3. internal '^'o'^k signal 122 .s at tw^e the Tre^^ operate on respective instruc- 

Durina first internal clock cycle 126, first stage instruction aecouc ou^y r .„.e_ded to second 

tions XO and YO. During second internal clock cyc^ 28 '"f J^J'^^^^^^^ ; J^,,,o„ decode units 
stage instruction decode stages 1D2. and new .ns^r^^^^ 

30 1D1 . During third internal clock cycle 1 30 '"^"f^^^^^^^^ XO. YO are in first address calcu- 

XI , Y1 are in second stage instruction decode stages^^^^^^^^ T^lturstsUge instruction decodestages 
lation units AC1 . During internal dock cycle 1 32. 'n^truchons X3 Y3 a^e "^^^^J J ^1 are in the first 

,D1 . instructions X2, Y2 are in second stage ^dL^ calculation stages AC2. 

address calculation stages AC1, and instructions XO and YO ^^^'"=™.^^„.y3„ti3,,y through th 

3S Asisevidentfromthisdescnption,successive.ns^^^^ .^J^^.^^ p^,. 

40 ware.' , ^. ehnwn no staae requires more than one 

The instruction flow shown in Figure 3 is the °P*''^';'^^^^^" ^/'^^^^ cycles to complete 

clock cycle. In an actual machine though, one - ^^^^^f^^ the flow of instructions 

45 

3. Write buffer architecture and operation 

AS sho„n in Figure 1a, -e buffer 29 Is loglcaiiy '-"^^ '"^"Xm°lS^b^^^ 
nectedtocoreZObywntabackbusesWB X Vjm^^^^^^^^^^^ 
so connected to ATU 60 to receive physical ''''''f^'r"' "^^,^^. „„„ 160 and Is also presented to 
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15 



20 



25 



30 



„ « , .^r» 9n ranidlv oerform a memory write operation (from its viewpoint) and go on 
write buffer 29 allows for core ^^J^ '^^"^^^^'^^^^ ^J^,y ,ead operations and without requiring wait, ' 
to the next instruc.>on ,n ^^^^^P'^^^^^J^^^^^^^^^^^^^^ Further, the memory write operation perfonned 
ZZTZ':^^^^^^^^^^^^ write ^c.e time, regardiess of whether the memory iocation 

is in unified cache 60 or in 7''"T"";°[,yJ',„3truction and operation of write buffer 29 according to the pre- . 

Referring now to Figure 4. the f be understood that the example of write buffer 

ferred embodiment of the mvent.on w.ll ''^^f superpipelined superscalar architecture of mi- 

^X^^^^^^^ r otUaW when utilized in micropro- 

through 152XS, 152yo '^^<^^l^2'lS^^^^ efficiency w'h the superscalar 

buffer 29 in this example .s P^^^f^^'*. °^ 1 52X 1 52y associated with the X and Y pipelines. 

rpSrreToX^S^rw^ S^r TsS^be organLd as a single ban. with each entry 

accessible by either of the X and Y , 50 ^^ich is combinatorial or sequential logic 

Write buffer 29 further mc ludes ^^f^^^^^^^ 20 in the manner described herein, 

specifically designed to ^^^^^^^^^ to this specifcation will be readily able 

•'"TergnowtoPlgureathecontents.^^^^^^^^^ 

scribed; It is to be understood, of course, '^^' ^^fj^'^^ f^^^ ° ^^.^ entry 152x, contains an address 

constructed according to this preferred '"^^^^^^^^ ,52 -3 identified by a four bit tag value 

portion, a data portion, and a control portion In entries 152 in write buffer 29. 

(not shown), as four bits are sufficient to ""''^"^^ "^'^^^^^^^^^^ (orsource data therefrom) 

For the thirty-two bit integer '"l^"]^^^^^^^^^ rp^ys -1 address bus PAx). and thirty- 

fcr the storage of a physical ^-"^^^ } this pref erred embodiment of the invention, 

two bits for storage of a four-byte data word. 33 n^ted below in Table A. These 

each entry 152x,further includes '-"'V-^^^^.^J^^ ^^^^^^^^^^ and Issuing of entries 152. In 
control bits are utilized by write ''^J ^ cache 60. are also able to access 

wTbe described in detail hereinbelow relative to the operation of wnte buffer 29. 
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Table A 

address valid; the entry contains a valid address 
data valid; the entry contains valid data 

readable; the entry is the last write in the pipeline to its physical 
address 

mergeable; the entry is contiguous and non-overlapping to the 
preceding write buffer entry 
non-cacheable write 

the entry corresponds to floating point data 
misaligned write 
write buffer no-op 

write-after-read; the entry is a write occurring later in program 
order than a simultaneous read in the other pipeline 
SPEC :four bit field indicating the order of speculation for the 

entry 

XDEP : cross-dependency map of write buffer section 152y 
SIZE : size, in number of bytes, of data to be written 



AV 
DV 
RD 

MRG 



NC 
FP 

MAW 
WBNOP 
WAR : 



NCRA : non-cacheable read has been previously allocated- 



Write buffersection152xreceh,es the results of either execnitionstageE>a of theXpjeline or ex^^^^^^^^ 
stage EXY of the Y pipeline via writeback bus WB_x driven by core 20; similarly, wnte buffer section 152y re- 
cces the results of either execution stage EXX of the X pipeline or execution stage EXY of the Y p.pehne v,a 

^"''S blJfeTsIc'iions 152x. 152y present their contents (both address and data sections) to cache port 
1 60. for example, via circuitry for properly formatting the data. As shown in Figure 4. ^''f f ^j" f .^'^^^^ 
presents its data to barrel shifter 1 64x. which in turn presents its output to misaligned ^^'^I'^'^^^J^^ 
be described in further detail hereinbelow. misaligned write latch 162x allows for storage of the data f cm w rte 
buffersection 152xfor a second write to cache port 160. which is performed according f '►'e Presen^nvent.on 
in the event that write to memory overlaps an eight-byte boundary. Misaligned wn e latch ^ 62x Pj^/^n^^^^ 
output directly to cache port 160. and also to write gather latch 165; write gather latch 165 as will be desc bed 
in further detail hereinbelow. serves to gather data from multiple write buffer entnes 152 for a s-ngle vw-Ke to 
cache port 1 60. in the event that the physical addresses of the multiple writes are in the same .«'9h "byte group 

wL buffer section 152y presents its output to one input of multiplexer 63. which ^«'^«!;;«« 
floating point data latch 186 at its other input; as will be described hereinbelow, floating point data latch 166 
ontns the output from the FPU 70. and provides sixty-four bitfioating pointd.ta f -^^f °- ^^oic 1 50 
corresponding to one of write buffer entries 152. Multiplexer 1 63 is controlled by wnte buffer ^^ntroHo^^^^^^ 
and by the cache control logic for unified cache 60. to select the appropriate input for presentation at its output, 
a v^l e dfscribed hereinbelow. The output of multiplexer 163 is presented to shifter 164y and in turn to mis- 
aligned write latch 162y. in similar manner as is the output of write buffer section ^^^x de«:n^^^^^^^^^^ 
output of misaligned write latch 162y is also similarly connected directly to cache port 160 and also to write 

'"'whH^Snly a Single cache port 160 is schematically illustrated in Figure 4 for simplicity of;xPl«n^t'°"- 
described hereinabove, cache port 160 according to this embodiment of the invention is a dual cache port 
enaS presentation of two write requests simultaneously. In addition, write ''""-^ f ocommun.^^^^^^^^ 
directly to data bus DATA. As such, according to this embodiment of the invention, t^e ^onne^^^^^^^^^ ° ^ 
port 160 shown in Figure 4 will be duplicated to provide the second simultaneous write to cache port 160. and 
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Will also be provided directly to data bus DATA to effect a memory write in the event that'cache control requires . 

a write to main memory 86. . ^ „ . u. * • j / 

Also according to the preferred embodiment of the invention, write buffer 29 is capable of sourcmg data 
directly from its entries 1 52 to core 20 by way of source buses SRCx. SRCy. under the control of write buffer 
control logic 150 which controls multiplexers 154x, 154y. The output of multiplexer 154x may be applied to 
either of the X or Y pipelines, under the control of pipeline control 28. via buses mem_x. mem_y to physical 
reaisters 24- similarly, the output of multiplexer 154y may be applied to either of the X or Y pipelines via buses 
mem x mem y In addition, writeback buses WB_x WB_y are also connected to multiplexers 154x, 154y via 
bypass 'buses'BP_x. BP_y. respectively, so that memory bypassing of write buffer 29 is facilitated as will be 

described hereinbelow. . ^ ,.■ 

As noted above microprocessor 10 includes an on-chip FPU 70 for performing floating point operations. 
As noted above the results of calculations performed by the FPU 70 are represented by sbcty-four bit data 
words According to this preferred embodiment of the invention, efficiency is obtained by limiting the data por- 
tions of wrte buffer entries 152 to thirty-two bits, and by providing sixty-four bit floating point data latch 166 
for receiving data from the FPU 70. Roating point data latch 166 further includes a floating point data va id 
FPDV) control bit which indicates, when set. that the contents of floating point data latch 166 contain valid 
data The address portion of one of write buffer entries 1 52 will contain the memory address to which t^^e res^^^^ 

from the FPU 70, stored in floating point data latch 166. are to be written; this write buffer entry 1 52 will have 
its FP control bit set. indicating that its data portion will not contain valid data, but that its con-esponding data 
will instead be present In floating point data latch 166. 

Alternatively of course, floating point data write buffering could be obtained by providing a sixty-four bit 
data Dortion for each write buffer entry 1 52. According to this embodiment of the Invention, however, pre-cache 
write buffering of sixty-four bit floating point data is provided but with significant layout and chip area ineffi- 
ciency. This inefficiency is obtained by not requiring each write buffer entry 152 to have a sixty-four bit da a 
portion- instead, floating point data latch 166 provides sixty-four bit capability for each of entry 152 in wnte 
buffer 29 It is contemplated that, for most applications, the frequency at which floating point data 's Provided 
bv the FPU70is on the same order at which thefloatingpointdata will be retired from floating pomtdatalatch 

- 1 66 (i e written to cache or to memory). This allows the single floating point data latch 166 shown in Figure 
4 to provide adequate buffering. Of course, in the alternative, multiple floating point data latches 166 could be 
provided in microprocessor 10 if additional buffering is desired. , . •„ h» 

The operation of write buffer 29 according to the preferred embodiment of the invention will now be de- 
scribed in detail. This operation is under the control of write buffer control logic 150. which is combinatorial or 
sequential logic arranged so as to perform the functions described hereinbelow. As noted above it is contem- 
plated that one of ordinary skill in the art will be readily able to implementsuch logic to accomplish the func- 
35 tionalltv Of write buffer control logic 150 based on the following descnption. v a 

Speci i^lly according to this embodiment of the invention, write buffer control logic 1 50 includes X and 
Y alloLion poLrs 156x. 156y, respectively, and X and Y retire pointers 158x 158y. respectively; pointers 
156 158 will keep track of the entries 152 in write buffer 29 next to be allocated or retired. respect.yely. Ac- 
cordi gly sections 152x. 152y of write buffer 29 each operate as a circular buffer for purposes of allocation 
40 Tnd rS and as a file of addressable registers for purposes of issuing data. Alternatively, wnte buffer 29 
may be implemented as a fully associative primary data cache. If desired. ^,f„^«H 
in general upon second address calculation stages AC2determining that a memory write will be perfomied 
duri g t execution of an instruction, one of write buffer entries 152 will be "allo^te at such time as he 
Dhvsical address is calculated in this stage, such that the physical address is stored m the address portion of 
45 fn e^l 152 and Its address valid control bit.(AV) and other appropriate control bits are set After execution 
of the n's ruction and during WBX. WBY (Fig. 1 a), core 20 writes the result in the data portion of hat write 
buffer Sry 1 SVisue" the write buffer entry, setting the data valid control bit (DV). The write buffer en ry 
Tsl s ie - in an asynchronous manner. In program order, by intenrogating the AV and DV bits of a se ected 
Infry 152 and, if both are set. by causing the contents of the address and data portions of the entry 152 to 
50 appear on the cache port 160 or the system bus. as the case may be. 

3.1 Allocation of write buffer entries 

Referrina now to Figure 6. the process for allocation of write buffer entries 1 52 according to the preferred 
55 embodtTSreTnv?ntlonwill^ 

oracess is performed as part of the second address calculation stages AC2 m both the X and Y P'Pf 
shown bl process 170 of Figure 6. the allocation process is initiated upon the calculation of a physical memory 
address to which results of an instruction are to be written (i.e.. a memory write). 
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For ease of explanation, the sequence of Figure 6 will be described relative to one of the sections 152x 
1 52y of Zl b.^er 29. The allocation of write buffer entries 1 52 in the opposite section of wnte buffer 29 w,ll 

'^^^^^^iL.^e., process 172 retrieves control bit AV fro. the write buffer entry 
152 toThich tSe?l ocation pointer 156 is pointing. Each side of write buffer 29 according to this embod^en 
IftheinvenSonoperatesasacircuiarbu^ 
en^ 52tobealL^^^ 

en trv 152 to which the appropriate allocation pointer 156x. 156y points will be referred o as 152n. Decs.on 
3 detfrn^^^s i?control'£t/v is set (1) or cleared (0). If control bit AV is already set -^^^^^iZ'l^C 
s already allocated or pending, as it has a valid address already stored therein. As such entry 52„ .s not avaH 
aw!to be allocated atJhis time, causing wait state 174 to be entered, followed by repeated retrieval and check- 

as itl not alJeadyScaTd or pending. In this case, process 176 stores the physical address calculated in 

temp^atedTat Ihese processes may be performed in any order deemed advantageous or suitable for the spe- 
cif ic realization by one of ordinary skill in the art. 

3.1.1 Read-after-multiple-write hazard handling 

According to this embodiment of the Invention, certain data dependencies are <'^^^';^ ^J^'^^^l'^^^ 
=n„P to write buffer accesses As is well known in the art. data dependencies are one type of hazard in a pi 

pe" dar hUec^Sm^^^^^^^^^ 

i-rreii^tr^^^^^^ 

l!:.S^:L m^T^^^^^ the pipeline contains a read of a physical memory add^ss that 

s 0 be p rCed^^^^^^ multiple wrLs to the same physical address and prior Jo the re^^^^^^^^^^^^^^ - « 
entries 1 52 assigned to this address. According to the preferred embodiment of the 'nve"bon. °nly wnte buffer 
eJ ries 152 having their control bit RD set can be used to source data to core 20 via buses SRCx SRCy. Th.s 
aSs the possibmty that incorrect data may be sourced to core 20 from a completed earlier wnte. instead of 
from a later allocated but not yet executed write operation to the same physical address. 

InCcernS I'e buffer control logic 150 examines t^ 

"''"Z'X °F«-e melhod b, *ch »e physW addresses of di«.r=n, m-o'V ^'^^ 
struSrrSZpa^' a p»»ss 178 according tothe p,e,e.ed e.bdd,.en.^^^^^ 

of any bytaa „l,fch ml bt wntten by both of tf,e «rite operations. In ^'^^'''JT'l 

£r:r£:pS^%=^^ 
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of the bits in map ANDSPAN to be true. Accordingly, and as will be described hereinbelow if a later read of,, 
wr te bu?e 29 is to be performed (i.e.. sourcing of data f ron. write buffer 29 pnor to ref nng). on y last-wntten 
write bS entry 152„ will have its control bit RD set and thus will be able to present ,ts data to core 20 via 
^0 rce us SrS. S^^^^ Those write buffer entries 152 having valid data (control bit DV set) but having their 
rnlral bU RD clear are prevented by write buffer control logic 150 from sourcng their data to buses SRCx. 
SRCy. 

3.1,2 Cross-dependency and retiring in program order 

As noted above, write buffer entries 152 mu3tbe retired (i.e.. written to unified cache 60 or main memory 
86^ t^^^amTie For those implementations of the present invention where only a single bank of wnte buf- 
fer entries 152 are ^sed. program order is readily maintained by way of a single retire pointer 158. However, 
because of the supe^^^^^ architecture of microprocessor 10. and in order to obtain layout eff cency in t he 
^ealteatbn c^f ^te bu^ 29. as noted above this example of the invention splits wnte buffer entr.es 1 52 into 
wo qlp3 oTe for each of the X and Y pipelines, each having their own retire pointers 158x. 158y. respec- 
t^elvThJiXedembodimentoftheinventionprovidesatechniqueforensuringretirementmpr^ 
hoJpenX section write buffer entries 152x and Y section write buffer entnes152y. _ ^„ ^ 

R^n^^^^oZure 8. a map of cross-dependency control bits XDEP for a selected write buffer entry 
152^ I le time o?its allocation, is illustrated. As shown in Figure 8. each write buffer entry 152x. in the X 
^JZ^lZfllr 29 has sb< cross-dependency control bits XDEPo through XDEPg. each bit corresponding 
portionof write buffer 29 s«Joss^^ ^ ^ ^^^^^ 29; similarly (and not shown in 

one or i^ch of the wnte buffer entries 1 52x, in the X section 152x of write buffer 29. As illustrated in Figure 
a the contends of each cross-dependency bit XDEP for write buffer entry 152;, corresponds to the state of 
c^nuol S for a correspondin^^^^^^^^ buffer entry 152y. in the Y section 152y of wnte buffer 29. at the time 

°' "p°r?c?^' 180 in the allocation process of Figure 6 loads cross-dependency control b^-XDEPo through 
XDEP for write buffer entry 152„ that is currently being allocated, with the state of the address valid co trol 
S forre sL write buffer entries 152y. in the Y section 152y of write buffer 29 at the time of allocation^ 
A J^bTdescrib^^ further detail hereinbelow. as each write buffer entry 1 52 is retired, its corresponding 
As will be ...XDEP in each of the write buffer entries 152 in the opposite portion of write buffer 

lai .7thTlc«lon sequonca, noaddittonal »ttlng of any of to own cro,s-<l«p.ndency con.rol b,ls XDEP 



ZL .«l.r 1. thus mdnaned b, rMuirinj thai in order I. retire a „dte buffer entry 1 52. ail * of te 

program oroer . vncp ti,rouoli XDEP.must be cleared (i.e.. equal to 0). Accordingly . the setting 

c™ss-dep.ndency '""'-^ ""^^5?lp75„„^^^^^^^ , -snapehor of Ihos. »rile buffer entdes 152 in 
of cross-dependency control b« >B^^^^^ ellocaled (i..., ahead of the allocated write buffer 

r,;rrhCXreq^"e "^^^ 

2c~Xr„S>?^describedhere^^^^^^^ 

formed in the proper program order. 

3.1.3 C mpletion of allocation process 

pror... 182 is then performed -in the allocation of write buffer entry 152„, in which certain control bits in 

« re h":=: rrl' Non-oachaabla write c^ntr. b,.(NC, is set if the memory ^ 



) 
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operation is to be non-cacheable. Mergeable control bit (MRG) is set for write buffer entry 152„ if the physical 
memory locations corresponding thereto are contiguous and non-overlapping with the memory locations cor- 
responding to a previously allocated write buffer entry 152,, such that a gathered write operation may be per- 
formed Write-after-read control bit (WAR) is set if the write operation to write buffer entry 152„ is to be per- 
formed after a simultaneous read In the other pipeline. Misaligned write control bit (MAW) is set if the length 
of the data to be written to the physical address stored in write buffer entry 1 52„ crosses an eight-byte .boundary 
(in which case two write cycles will be required to retire write buffer entry 152„). Control bit NCRA is set if a 
non-cacheable read has previously been allocated and not yet performed. 

On£e the storing of the physical address and the setting of the control bits in write buffer entry 152„ is 
complete, control bit AV for write buffer entry 152„ is set In process 184. in addition, if not previously cleared 
by a previous retire operation, control bit DV is cleared at this time. The setting of control bit AV indicates the 
aLation of write buffer entry 152nto subsequent operations, induding the setting of cross-dependency con- 

trol bits XDEP upon the allocation of a write buffer entry 152 in the opposite section of write buffer 29. 

In process 1 86. write buffer control logic 150 returns the tagvalue of now-allocated write buffer entry 152„ 
to core 20 Core 20 then uses this four bit tag value in its execution of the instruction, rather than the full thirty- 
two bit physical address value calculated in process 170. The use of the shorter teg value facilitates the exe- 
cution of the instruction, and thus improves the performance of microprocessor 10. 

The allocation sequence is completed in process 188. in which allocation pointer 156x. 156y (depending 
upon whether write buffer entry 152„ is in the X or Y sections 152x, 152y of write buffer 29) is incremented to 
20 point to the next write buffer entry 1 52 to be allocated. Control then passes to process 90.^which is the as- 
sociated EX stage In the pipeline. If the instruction associated with the write Is not prohibited from moving for- 
ward In the pipeline for some other reason. 
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3.2 Issuing of data to write buffer entries 



Referring now to Figure 9. the process of issuing date to write buffer entries 152 will be described in detaH 
relative to a selected write buffer entry 1 52, As noted above, the issue of date to write buffer 29 is perfomied 
by core 20 after completion of the EX stage of the instruction, and during one of WB stages depending upon 
whether operation is In the X or the Y pipeline. , . . u » 

The issue sequence begins with process 192. in which core 20 places the date to be written to wnte buffer 
29 on the appropriate one of writeback buses WB_x. WB_y. depending upon which of the X or Y pipelines is 
executing the instruction. Core 20 is also communicating the teg of the destination write buffer entry 1 52 to 
write buffer control logic 150. Write buffer control logic 150 then enables write buffer entry 152,, which is the 
one of write buffer entries 152 associated with the presented tag value, to latch in the data presented on its 
35 associated writeback bus WB_x. WB_y. in process 194. Once the storage or latching of the data in write buffer 
entry 1 52, is complete, control bit DV is set in process 196. ending the issuing sequence. 

Once write buffer entry 152, has both its control bit AV and also its control bit DV set, write buffer entry 
152, is in ite "pending" state, and may be retired. As noted above, the retiring of a write buffer entry 152 is 
accomplished on an asynchronous basis, under the control of cache logic used to operate unified cache 60, 
such that the writing of the contents of write buffer entries 152 to unified cache 60 or mam memory 86 o<xurs 
on an as available basis, and does not Intenrupt or delay the perfomiance of cache or main memory read op- 
erations. Considering that memory reads are generally of higher priority than memory wntes. due to the de- 
pendence of the program being executed upon the retrieval of program or data from memory, wnte buffer 29 
provides significant performance improvement over conventional techniques. 



3.3 Retiring of write buffer entries 



Referring now to Figure 10. the sequence by way of which write buffer entries 1 52 are rehred under the 
control of cache control logic contained within or provided in conjunction with unified cache 60j"" "ow be 
50 described in detail. Certain special or complex write operations will be descnbed m speaf ic detail hereinbelow. 
As such, the retiring sequence of Figure 10 is a generalized sequence. • 

3.3.1 Retiring f integer write buffer data 

55 As noted above, the retiring sequence of Figure 10 is performed underthe control of cache control logic 
contained within or in conjunction with unified cache 60, and is asynchronous relative to the operation of the 
X and Y pipelines. As noted above, it is important that write buffer entries 152 be retired in program order. Ac- 
cordingly, write buffer 29 operates as a circular buffer with the sequence determined by retire pointers 158x. 

14 
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158Y for the two portions of write buffer 29. Retire pointers 158x. 158y maintain the program order of write 
Sr entries 152 in their corresponding sections 152x. 152y of write buffer29. and cross-dependency control 
bSsS^^^^^^ order of entSes 152 between sections 152x. 152y, as will be noted from the following de- 

For"ease of explanation, as in the case of the allocation sequence described hereinabove the sequence 
of Figure 1 0 will be described relative to one of the sections 152x. 152y of write buffer29. The ret.nng sequence 

r S .t OV and ccrol bit AV 

are re^rievI5 rom write buffer entry 152, which is the one of write buffer entries 1 52 that retire pointer 58 is 
fnMna as the next entry 152 to be retired. In decision 201 . control bit FP and control bit AV are tested to 

2 FPU 70^ If both control bit FP and control bit AV are set. write buffer entry 152, is as- 
StriSingpltd^^^^ 

^''tcomrol bit AV isset and floating point control bit FP is clear, write buffer entry 1 52, is directed to integer 
data Dec si 202 is next performed! in which the cache control logic determines If control bit AV and con ro . 
b DV areToJh s^^^^^^ not. (either of AV and DV being clear), entry 152/is not ready to be retired and control 
p sses to process 200 for repetition of the retrieval and decision processes. If both are set. valid integer data 
Lresentinthedataportionofwritebufferentry152,andtheentryrrayberetirabl^e. 

DecJion 204 Is then performed to determine if cross-dependency control bits XDEP are all clear for write 
buffer eTt y 15°; AS described hereinabove, cross-dependency control bits XDEP are a snapshot of the con- 
S WtrX the vvrite buffer entries 152 In the opposite section of write buffer 29 beg.nn,ng at allocation of 
h!.ffrpn rv 152 and updated upon the retirement of each write buffer entry 152. If all of the cross- 
ed n^c^^^^^^^^^^ ^^'^ ^"'^^ '''' '^1 r'-' 158 is pointing to It), 
depenaency c oroaram order to be retired, and control passes to process 208. 
^'VT^Zln^^C cZZ msXDEP are not all dear, than additional write buffer entries 152 in the op- 
• 1 «rt1on «f write buffer 29 must be retired before entry 152, may be retired, so that program order may 
Tetain^aTed ta^^^^^^^^ fo'-ed by repetition ^f decision 204, until the ^ite buffer entries 
52" the ^ppo^e section that were allocated prior to the ^location of write buffer entry 152, are retired first. 

AS wm be described in detail hereinbelow. microprocessor 10 may include provisions for Pe^formmg non- 
.^chaaWe reads from main memory 88. which must be performed in program order. The presence of a previ- 
T^^r^tTJanl^helb^e read is indicated for each write entry by control bit NCRA being set; upon exe- 
^ ously allocated """"^^^'^^f '''^'f ^ ' „ " , j^.. ^jcra is cleared for all write buffer entries 152. If this feature is 
cution °f '[-----f^^t^^^^^ NCRA. and prevent the retiring of write buffer 

plcei 208 is then performed, in which the data section of write buffer entry 152, is aligned w-th the ap 
p^priate b" o^b^e position for presentation to cache port 160 or to the memory bus. This alignment .s n e - 
propnate Dii or oyx p corresponds to specify: byte locations, but the data s 

essary consider ng hat the physic^ m^ alignment of data with the proper bit positions is 

? KuffprS f n orocess 212) This allows the next write buffer entry 152 in sequence (i.e.. the write buffer 

'°°As":srbr:hrr^^^^^ 

t nrt ess 21 0 hereinabove Tache port 1 50 serves as a dual cache port and write buffer 29 in microprocessor 

trol logic will select the P „ additional streamlining in the case where two 

sect!:~bufS^^^^^^^ 

entries 1^2 (c^^^^^^^^^^^ X and V sections 152x. 152y of write buffer 29) simultaneously via the dual 
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cache port 1 60. If such simultaneous presentation of data is provided, the cross-dependency decision 204 must 
allow for one of the write buffer entries 152 to have a single set cross-dependency control bit XDEP, so long 
as the simultaneously presented write buffer entry 152 con-esponds to the setXDEP bit. The retiring process 
may thus double its output rate by utilizing the two sections 152x. 152y of write buffer 29. 

5 

3.3.2 Retir of floating point write buffer data 

If decision 201 determines that both control bit AV and control bit FP are set write buffer entry 152r to 
which retire pointer 158 points is associated with floating point results from the FPU 70. According to this em- 

10 bodiment of the invention, control bit DV for entry 152r will also be set despite the absence of valid integer 
data therein, for purposes of exception handling as will be described hereinbelow. 

Decision 203 is then performed, by way of which the cache control logic interrogates control bit FPDV of 
floating point data latch 166 to see if the FPU 70 has written data thereto, in which case control bit FPDV will 
be set Control bit FPDV is analogous to control bit DV of write buffer entries 152, as it indicates when set that 

15 the FPU 70 has written valid data thereto. Conversely, if control bitFPDVis clear, the FPU 70 has notyet written 
data to floating point data latch 1 66, in which case decision 204 will return control to process 200 in the retire 
sequence of Figure 10. 

If control bit FPDV Is set decision 205 is then performed by way of which cross-dependency control bits 
XDEP of write buffer en try 152^ are interrogated to see if all bits XDEP are cleared. If not additional write buffer 

20 entries 152 that were allocated in program order prior to entry 152,. and that reside in the opposite section of 
write buffer 29f rem entry 1 52,. must be retired prior to entry 1 52, being retired. Wait state 207 is then executed, 
and decision 205 is repeated. Upon all cross-dependency control bits XDEP of entry 152^ becoming clear, de- 
cision 205 passes control to process 208. for alignment and presentation of the contents of floating point data 
latch 166 to cache port 160. As noted above, if simultaneous presentation of two write buffer entries V62 are 

25 allowed via dual cache port 1 60. one of the entries 152 may have a single set XDEP bit so long as it corresuo'^ios 
to the simultaneously presented entry of the pair. 

Cross-dependency control bits XDEP in opposite section entries 152 are then cleared (process 212), con- 
trol bit AV and control bit FPDV are cleared (process 214). and retire pointer 158 is incremented (process 21 6), 
as in the case of integer data described hereinabove. 

30 

3.4 Ordering of non-cacheable reads 

The cross-dependency scheme used in the allocation of write buffer entries 152 described hereinabove 
may also be used for other functions in microprocessor 10. Similarly as for non-cacheable writes described 

35 hereinbelow. microprocessor 1 0 may have instructions in its program sequence that require non-cacheable 
reads from memory. By way of definition, a non-cacheable read is a read from main memory 86 that cannot 
by definition be from unified cache 60; the non-cacheable read may, for purposes of this description, be con- 
sidered as a single entry read buffer that serves as a holding latch for requesting a read access to main memory 
86. In order to ensure proper pipeline operation, non-cacheable reads must be executed in program order. Ac- 

40 cordingiy. especially in the case of superpipelined superscalar architecture microprocessor 10 described here- 
• In, a method for maintaining the program order of non-cacheable reads is necessary. 

Refemng now to Figure 17. non-cacheable read cross-dependency field 310 according to the preferred 
embodiment of the invention is illustrated. Non-cacheable read cross-dependency field 310 is preferably main- 
tained in cache control logic of unified cache 60. and includes allocated control bit NCRV which indicates, when 

45 set that a non-cacheable read has been allocated. Similarly as cross-dependency control bits XDEP described 
hereinabove, and as described above, control bit NORA in each write buffer entry 152 is set, at the time of its 
allocation, if allocated control bit" NCRV is set indicating that a non-cacheable read is previously allocated. 
Control bit NCRA is tested during the retiring of each write entry 152 to ensure proper ordering of requests to 

main memory 86. - . 

50 In addition, non-cacheable read cross-dependency field 31 0 contains one bit position mapped to each of 

the control bits AV of each write buffer entry 152. to indicate which of write buffer entries 152 are previously 
allocated at the time of allocation of the non-cacheable read, and to indicate the retirement of these previously 
allocated write buffer entries 152. Non-cacheable read cross-dependency field 310 operates in the same man- 
* ner as cross-dependency control bits XDEP, with bits set only upon allocation of the non-cacheable read, and 

55 cleared upon retirement of each write buffer entry. 

Refemng now to Figures 18a and 18b, the processes of allocating and retiring a non-cacheable read op- 
eration according to the preferred embodiment of the invention will now be described in detail. In Figure 18a, 
the allocation of non-cacheable read is illustrated by process 312 first detemnining that an instruction includes 

16 
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a non-cacheable read. Process 314 is then performed by way of which a snapshot of the control bits AV are 
loaded into non-cacheable read cross-dependency field 310. Process 316 is then performed, in which allocated 
control bit NCRV in non-cacheable read cross-dependency field 310 is set, indicating to later-allocated write 
buffer entries 152 that a non-cacheable read operation has already been allocated. Address calculation stage 
AC2 then continues (process 318). 

Figure 18b illustrates the performing of the non-cacheable read, under the control of the control logic of 
unified cache 60. Decision 319 detemnines If non-cacheable read cross-dependency field 310 is fully dear. If 
any bit in non-cacheable read cross-dependency field 310 is set. one or more of the write buffer entries 152 
allocated previously to the non-cacheable read has not yet been retired; wait state 321 is then entered and 
decision 319 repeated until ail previously allocated write buffer entries have been retired. 

Upon non-cacheable read cross-dependency field 310 being fully clear, the non-cacheable read is next 
in program order to be performed. Process 320 is then executed to effect the read from main memory 86 in 
the conventional manner. Upon completion of the read, allocated control bit NCRV in non-cacheabie read 
cross-dependency field 310 is cleared in process 322. so that subsequent allocations of write buffer entries 
1 52 will not have their control bits NCRA set Process 324 then clears control bits NCRA in each of write buffer 
entries 152, indicating the completion of the non-cacheable read and allowing retiring of subsequent write buf- 
fer entries 1 52 in program order. 

Considering that control bits NCRA in write buffer entries 152. taken as a set, conrespond to non-cacheable 
read cross-dependency field 31 0, it is contemplated that the use of a single set of these indicators can suffice 
to control the program order execution of the non-cacheable read. For example, if only non-cacheable read 
cross-dependency field 310 is used, allocation and retiring of write buffer entries 152 would be controlled by 
testing field 310 to determine if a non-cacheable read has been allocated, and by testing the corresponding 
bit position In field 310 to determine if the particular write buffer entry 152 was allocated prior to or after the 
non-cacheable read. 

Therefore, according to this preferred embodiment of the invention, non-cacheable read operations can 
be controlled to be performed in program order relative to the retiring of write buffer entries 152. 

4. Read-after-write hazard detection and write buffer operation 

As discussed above, certain hazards are inherent in pipelined architecture microprocessors, and particu- 
larly in superpipelined superscalar microprocessors such as microprocessor 1 0, An important category of such 
hazards are data dependencies, which may occur if multiple operations to the same register or memory location 
are present in the pipeline at a given time. 

Afirst type of data dependency is the RAW, read-after-write, data dependency, in which a write and a read 
to the same memory location are present in the pipeline, with the read operation being a newer instruction 
than the write. In such a case, the programmer has assumed that the write will be completed before the read 
is executed. Due to pipeline operation, however, the memory access for the read operation may be perfonned 
prior to the execution of the write, particularly if the read operation is implicit in another instruction such as 
an add or multiply. In this event, the read will return incorrect data to the core, since the write to the memory 
location has not yet been performed. This hazard is even more likely to occur in a superscalar superpipelined 
architecture of microprocessor 10. and still more likely if instructions can be executed out of program order, 
as described above. 

Refen-ing to Figure 11, the sequence of detecting and handling RAW.hazards in microprocessor 10 ac- 
cording to the preferred embodiment of the invention will now be described in detail. In this example. RAW 
hazard detection occurs as a result of physical address calculation process 218 perfomied in the second ad- 
dress calculation stage AC2 of the X and Y pipelines for each read instruction. In decision 219, write buffer 
control logic 150 compares the read physical address calculated in process 218 against each of the physical 
address values in all write buffer entries 152. regardless of pipeline association. This comparison not only com- 
pares the physical address of the read access to those of the previously allocated addresses, but also considers 
the span of the operations, in the manner described hereinabove relative to process 178 in Figures 6 and 7. 
This comparison is also performed relative to the instruction cun-ently in the second address calculation stage 
of the opposite X or Y pipeline. If there is no overlap of the read operation with any of the writes that are either 
previously allocated, or simultaneously allocated but earlier in program order, no RAW hazard can exist for that 
particular read operation, and execution continues in process 222. If decision 219 determines that there is a 
match between the physical address calculated for the read operation and the physical address for one or more 
write buffer entries 152^ that is allocated for an older instruction and has its address valid control bit AV set 
or that is allocated for a simultaneously allocated write for an older instruction, a RAW hazard may exist and 
the hazard handling sequence illustrated in Figure 11 continues. 
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As noted above, one of the control bits for each write buffer entry 152 is write-after-read control bit WAR. 
This control bit indicates that the write operation for which a write buffer entry 1 52 is allocated is a write-after- 
read, in that it is a write operation that is to occur after an older (in program order) read instruction that is in 
the second address calculation stage AC2 of the opposite pipeline at the time of allocation. Control bit WAR 
is set in the allocation sequence (process 182 of Figure 6) if this is the case. This prevents lockup of micro- 
processor 10 if the newer write operation executes priorto the older read operation, as the older read operation 
would, upon execution, consider itself a read-after-write operation that would wait until the write is cleared; 
since the write operation is newer than the read and will wait for the read to clear, though, neither the read nor 
the write would ever be performed. Through use of control bit WAR, microprocessor 10 can determine if an 
apparent RAW hazard is in fact a WAR condition, in which case the write can be processed. 
' - Accordingly, referring back to Figure 11. decision 221 determines if control bit WAR is set for each write 
buffer entry 152„ having a matching physical address with that of the read, as determined in decision 219. 
For each entry 152,, in which the WAR bit is set, no RAW conflict exists; accordingly, if none of the matching 
entries 152« have a clear WAR bit, execution of the read continues in process 222. However, for each matching 
write buffer entry 152„ in which write control bit WAR is not set, a RAW hazard does exist and the hazard han- 
dling sequence of Figure 11 will be performed for that entry 1 52„. Of course, other appropriate conditions may 
also be checked in decision 221, such as the clear status of the write buffer no-op control bit (WBNOP), and 
the status of other control bits and functions as may be implemented in the particular realization of the present 
invention. 

Decision 223 is next performed in which the control bit AV. address valid, is tested for each RAW entry 
1 52„. Decision 223 is primarily performed to determine if those RAW entries 1 52„ causing wait states for the 
read operation (described below) have been retired, if no remaining RAW entries 152* have their control bits 
AV set, the RAW hazard has been cleared and the read operation can continue (process 222). 

For each of the remaining matching RAW entries 1 52w. process 224 is next performed to determine if the 
entry is bypassable, or if the write causing the hazard must be completed priorto continuing the read operation. 
According to the preferred embodiment of the invention, techniques are available by way of which unified cache 
60 and, in some cases write buffer 29, need not be written with the data from the write prior to sourcing of the 
data to the read operation in core 20. 

Such bypassing is not available for all writes, however. In this example, the results of non-cacheable writes 
(indicated by non-cacheable control bit. NC, being set In entry 152) must be sourced from main memory 86. 
Secondly, as discussed hereinabove, a special case of RAW hazard is a read after multiple writes to the same 
physical location. As shown in Figure 8. process 178 of the allocation sequence sets control bit, RD, of a write 
buffer entry 152 and clears control bit RD of all previously allocated write buffer entries to the same physical 
address. Conversely, those write buffer entries 1 52 that are not readable (i.e.. their control bit RD is clear) can- 
not be used to source data to core 20. as their data would be in error, Thirdly, data cannot be sourced from a 
write operation if the subsequent read encompasses bytes not written in the write operation, as an access to 
cache 60 or main memory 86 would still be required to complete the read. 

In the RAW handling sequence of Figure 11 . process 224 is perfonmed on each matching write buffer entry 
1 52„ to determine If the control bit RD for entry 152w is set (indicating that entry 152,, is the last entry 152 
allocated to the physical address of the read), to determine if the control bit NC is clear (indicating that the 
write is not non-cacheable). and also to determine if the physical address of the read is an "exact" match to 
that of the write to write buffer entry 152«. in that the bytes to be read are a subset of the bytes to be wntten 
to memory. An entry 1 52„ for which all three conditions are met are said to be "bypassable". and control passes 
to decision 225 described below. If no bypassable entry 152w existis. as one or more of the above conditions 
(non-cacheable. non-readable, or non-exact physical address) are not met. wait state 229 is effected and con- 
trol passes back to decision 223; this condition will remain until all non-bypassable entries 152* are retired as 
Indicated by their control bits AV being clear, after which the read operation may continue (process 222). 

In this embodiment of the invention, the method of bypassing applicable to each bypassable entry 152^ 
is determined in decision 225. in which control bit DV, data valid, is tested to determine if write buffer entry 
152„ is pending fi.e.. contains valid data) but not yet retired. For each bypassable entry 152^ that is pending, 
process 230 is performed by write buffer control logic 150 to enable the sourcing of the contents of the data 
portion of write buffer entry 152„ directly to core 20 without first having been written to memory. Referring to 
Figure 4 process 230 is effected by write buffer control logic 1 50 enabling write buffer entry 1 52,,. at the time 
of the read operation, to place its data on Its source bus SRC (i.e., the one of buses SRCx. SRCy for the section 
of write buffer 29 containing entry 152*) and by controlling the appropriate multiplexer 154 to apply source 
bus SRC to the one of the X or Y pipelines of core 20 that is requesting the data. In this .case, therefore, the 
detection of a RAW hazard is handled by sourcing data from write buffer 29 to core 20. speeding up the time 
of execution of the read operation. 
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Forthosebypassable write buffer entries 152„that are not yet pending, however, as indicated by decision • 
225 nnding that control bit DV is not set. valid data is not present in entry 1 52w, and cannot be sourced to core 
20 tl^Lre f rom Process 232 is perfomied for these entries 1 52« so that, at the t.me that the write by core 20 to 
wriebSferentry152« occurs, thevalid data on writeback bus WB_xorWB_y(aiso present on t^^ 
rnfbvDa^s bu BP X BP y and applied to the appropriate one of multiplexers 154x, 154y) will be applied to 
he et titx or Y p^elin^ in coS 20. In this way. the RAW hazard is handled by bypass'ing write tjuffer 29 
wi^ he va d data, further speeding the execution of the read operation, as the stonng and retneval of valid 
data from cache 60. main memory 86. or even the write buffer entry 152« are not required pnor to sourcng of 
the data to core 20. 

5. Speculative execution and exception handling 
5.1 Speculative execution 

As noted above, superpipelined superscalar microprocessor 10 according to the preferred embodiment of 
the ivent on s capable of executing instructions in a speculative manner. The speculation anses from he 
exe^urn of one or more instructions after a conditional branch or jump statement, pnor to de ermming the 
sS^of the condition upon which the jump or branch is based. Without speculative executK,n. the micropro- 
citr woul^have to wait for the execution of the instruction that determines the state of the condition, pnor 
to execrrof any subsequent instructions, resulting in a pipeline "stall" condition, n speculative execu .on 
microprressor^^^^^^ to the state of the condition, and executes instoictions based on this speculation, 

"frof^^^^^^^^^ stalls is reduced significantly, depending upon the number of speculative executions un- 
riPrtaken and the rate at which the speculation is accurate. , . 

oo^cessor 10 according to this embodiment of the invention includes circuitry for rapidly cleanng the 
effe^ o f uns^^^^^^^^^^^^^ speculation, particularly in ensuring that the results of speculative writes are not retired 
to mtmo^^^^^^^^^^ removing the speculatively written data from write buffer 29. Referring now to Figures 12a 
Ln??ra method for executing speculative writes and handling unsuccessful speculatK^n will now be descn- 
bed n d^iTThe fl^ diagrams of Figures 12a and 12b illustrate this method byway of example rather than 
bed m ;• ' J™ f contemplated that one of ordinary skill in the art having reference to the following 

rsSn—m^rw^ 

"%°he tlmp'ry sequence of Figure 12a begins with process 240. in which core 20 -'-^s a series of in- 
me s^=^P'^'j' soeculative manner, in that the series of instructions correspond to one result 

Srcr^onal ':"jrJu:::::lZ^^^ not ;et know. The determination of which of the condHiona, 
branches i e whether or not to take the conditional branch or jump) to select may be made -carding to con- 
brancnes ii.a.. wneinB ,„ nrocess 242 allocation of two write buffer entries 1 52a, 1 52b {the 

speculate wn»,n^^^ h herelnatoe. However, because the write operalions 

^ShWnTp buffer entries 152c 1 52d are npt yet allocated, and as such their speculation control bits are clean 
I n'reTrnt of Figure 12a. second order speculation also occurs, such that one of the instructions in 
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the branch selected in process 240 included another conditional branch or jump, for which predictive branch 
selection is again performed in process 246 to keep the pipeline from stalling. Second order speculation means 
that in order for the execution of the instructions for the branch selected in process 246 to be successful, not 
only must the selection in process 246 be correct but the selection in process 240 must also be correct. While 

5 process 246 is shown in Figure 12a as occunring after the execution of the instructions in process 244, due to 
the superpipelined architecture of microprocessor 10 described hereinabove, the predictive branching of proc- 
ess 246 will often occur prior to completion of the execution initiated in process 244. Following selection of 
the branch in process 246, write buffer entry 1 52c is allocated In process 248 (again during the second address 
calculation pipeline stage). In this allocation of process 246, since any write to write buffer entry 152c is of 

10 second order speculation, both the j and k SPEC control bits are set. The state of control bits SPEC for write 
buffer entries 152a, 152b, 152c, 152d after process 246 is shown in Figure 12a, Execution of the speculative 
instructions in the branch selected in process 246 is then initiated in process 250. 

in the example of Figure 12a, thind order speculation is also undertaken, meaning that the sequence of 
instructions in the branch selected in process 246 also includes another conditional branch or jump. Process 

15 252 selects one of the branches according to predictive branch selection; however, in order for this third order 
selection to be successful, all three of the selections of processes 240, 246 and 252 must be successful. Again, 
as before, process 252 may make the selection of the branch prior to completion of the execution of the In- 
structions in process 250, considering the superpipelined architecture of microprocessor 10. In this example, 
write buffer entry 152d is allocated in process 254, with the three j, k and 1 SPEC bits set in write buffer entry 

20 1 52d. The state of the control bits SPEC for write buffer entries 152a through 1 52d after process 254 is illu- 
strated in process 254. Process 256 then executes the instructions of the branch selected in process 252, In- 
cluding a write operation to write buffer entry 152d. 

Referring now to Figure 12b, an example of the handling of both successful and unsuccessful speculative 
execution by write buffer 29 will now be described. As in the example of Figure 12a, the sequence of Figure 

25 12b is by way of example only rather than for the general case, but it is contemplated that one of ordinary skill 
in the art will be able to readily realize the method in a microprocessor architecture. 

In process 260, core 20 detects that the first selection of process 240 was successful, such that the con- 
dition necessary to cause the branch (or non-branch) to the instructions executed in process 244 was satisfied 
in a prior instruction. Accordingly, the contents of the data portions of write buffer entries 152a, 152b allocated 

30 in process 242 and written in process 244 may be retired to memory, as their contents are accurate results of 
the program being executed. In process 262, therefore, the j SPEC bits of all speculative write buffer entries 
152a, 152b, 152c. 152d are cleared; the state of control bits SPEC for write buffer entries 152a through 152d 
after process 262 is illustrated in Figure 12b. Since write buffer entries 152a, 152b now have all of their spec- 
ulation control bits SPEC clear (and since its data valid control bit DV was previously set), write buffer entries 

35 1 52a, 152b may be retired to unified cache 60 or main memory 86, as the case may be. 

In the example of Figure 12b, the second branch selection (made in process 246) is detected to be unsuc- 
cessful, as the condition necessary for the instructions executed in process 248 was not satisfied by the prior 
instruction. Furthermore, since the selection of the branch made in process 252 also depended upon the suc- 
cessful selection of process 246, the condition necessary for the instructions to be executed in process 256 

40 also will not be satisfied. To the extent that the writes to write buffer entries 152c, 152d have not yet been per- 
formed, these writes wiil never be performed, because of the unsuccessful predictive selection noted above; 
to the extent that these writes occurred (i.e., write buffer entries 1 52c. 152d are pending), the data should not 
be written. to memory as it is in.en-or. Accordingly, write buffer entries 152c. 152d must be cleared for additional 
use. without retiring of their contents. 

45 The sequence of Figure 12b handles the unsuccessful speculative execution beginning with process 266, 

in which those write buffer entries 152 having their k SPEC bit set are identified by write buffer control logic 
1 50. In this example, these identified write buffer entries 1 52 are entries 1 52c (second order speculation) and 
152d (third order speculation), in process 268, write buffer control logic 150 clears the address valid control 
bits AV for each of entries 152b, 152c. such that entries 152c, 152d may be reallocated and will not be retired 

50 (see the retire sequence of Figure 10, in which the AV bit must be set for retiring to take place). 

As described hereinabove, retire pointers 158x. 158y point to the ones of write buffer entries 152 next to 
be retired. According to the preferred embodiment of the invention control bits WBNOP are set for write buffer 
entries 152c, 152d, such that when the associated retire pointer 158 points to entries 152c, 152d, these entries 
wil! be skipped (as though they were never allocated). This allows for retire pointers 1 58 to "catch up" to ailo- 

55 cation pointers 156 if their section of write buffer 29 is empty. Repeated checking of the address valid control 
bits AV in the retire process can then safely stop, once the empty condition has been met. 
Execution of the proper conditional branch can resume in process 270 shown in Figure 12b. 

20 
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5.2 Exception handling 

rs-.^^wr^a cfoiie and bubbles may occur in the event that execution of 
,n addition to speculafve execution P'J^^^^^^ '^^^^^^^^^ An example of an exception 

an instruction returns an error condition. '^^^^^XteTsL^^^ detected in the execution stage 

is where core 20 detects a divde-by-zero cond^.on. ^^^"^^^^^^^^^^^^ ^he exception condition to be 

of the pipeline, the instructions sfli m the P'P^''"; .-""^ '^^^^^ buffer 29. those write buffer entries 

properly handled in the convent.ona "^f ""f^^^^P^^' .^^^^ be flushed. Since the writes to 

152 which were allocated after ^^7"^'^^^°" ^^^^^^^^^^^ be set) because of the removal 

these entries 152 will never ocour^^f^^^^ ^er retire from write buffer 29 if not othe^se 

of the write instructions from P P^J'"^^^"*^^^^^^^^^ for data that would never arrive, 

flushed: microprocessor 10 would t'^^";^?^ '"f ^'"'^^^^^^^^ handling exceptions relative to write buffer 29 
Referring "^^toJ^ Qure 13. an example^ exception condition. Process 274 is then 

will now be described .n detail. In process 272. core 20 ^etec^ l^;^ ^^^^^^^ ^^^^ 

performed by write buffer control og.c 50 in JJ^^/s^^^^^^ of the control bits AV are 

each write buffer entry 152 in wnte ^^^ ^^ "^^^^^^^ ^ con^l b^AV set. decision 275 tests its control bit 
set in write buffer 29. For each write buffer 152 that has its c™ d , ^^^^^ 

DV. data valid, to determine if it is 'J "^^'"'rnVwS bu^^^^^^^^^ set for that entry 152. 
thetimeoftheexceprioh).cont.,t^^^^^ 

mined by decisions 273. 275). data was wntten Dy^^ P asynchronous retiring sequence as de- 

to these locations is '^^'''''^'''''Zl^^^^^^^ the processing of the exception by microprocessor 

scribed hereinabove relahve to ^^l^'-J^^^^^^^^ j3,„,3tion (i.e., write buffer29 must be empty). 

10. all entries of write buffer 29 nnust be ^etirea ana ^^^^^^ ^.^^ ^g^,^ 

control of the sequence thus returns to P™'^^^^ /M.^^ ^^te buffer entries 1 52 are clear. Both 

retrieved and interrogated, until such time as the ^^'^'^^^^''^.^^^^^ retire pointers 158x, ISSy 

allocation pointers 156x. 156y will P-"° "^"^^^^^^^^^^^^^ 

rerSrcr^cfn is processed in the usual manner. 

6 Special write cycles from the write buffer 

buffer 29 10 caoh, port 160 or direolly lo "Xc STXed Jrtte,. and also write gathering. Se- 

:::SrT,xr.%e:e":tt^^^ 



40 



now be described in detail 
6.1 Misaligned writes 



45 



50 



55 



the physical address in microprocessors ^f^^^'^"^''^^^^^^^ these writes are referred to as -mls- 

nif leant fraction of memory writes may ^^'^/'^P^^^f J^^^^^^^ embodiment of the invention 

be described in detail relative to Figur^^^^^^^^ 

Figure 14 is a flow diagram 'f^^^'^^" ° ^^^^^^^^^ 152 being allocated. In process 280 of Fig- 
misaligned writes and indicating '^l^^^^^^^J^/jJ^^^^^^^^ ist byte address) of the write operation to 
ure 14. write buffer control logic 150 ^dds the physica ^^^J^^^^^^^^^^ information regarding the 
write buffer entry 1 52„ being allocated J.^^'^f^^Jf^.^^^ for X86 type microprocessor inslruc- 
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ess 284 is pertomd to sel control bil MAW «i wtry " ^ jjj^ess portion ot entry 

is .hen allocated lor purposes of the misaligned »'»•■ " ^'f ^'^^^^^ ' (|.e., ,he eight-byte address 
,52„, with the ^V=i-, s« ad^js .^^^^^^^^^^^^ :e<:n"™1bT/SSf:n.r, Ane« physical address 

*CrSsrthrh:r=rj...a.^^^^^^^^ 
':^s;s:rrc:sr^^trd,=.=f^^ 

no issuing of data to entry 152„., will occur. misaligned write in the retiring of a write buffer 

Referring now to Figure 15. a sequence for ^'^^"'^'"Sj,'^^^^^^ the sequence of Figure 15 

entry 1 52 will now be described. As In he P^^^^^';'^^^^ loT^mZsls^nce from write buffer control 
is preferably performed under the control of the "^^/^^ ^^^^^^ 210 of Figure lOdescribed 

logic 150. The sequence of Figure 15 ^^^^^^J^ ^^ n^Srcon r^' bit MAW of entry 152„ Is tested; if 
hereinabove. This sequence begins ^^^^^^^^^^ the manner described above. However, if 

::::bTr i:=^^^^^^ ^::::S!^:S:..o^. . ..^^ .e data portion of entry 152„ 

^ •"^'ten;:=^^^^^ 

aligned nature of the write. However. r/P''"'"/^^^-^^^^^^^^^ to Figure 4. shifter 

152„ is not in the proper "byte lanes" ^If'^'^'^frJ'^^'^^^^^^ the corresponding write, buffer section 
164 is a conventional barrel shifter for shiftrng the da^^^^^^^^^ 

. 152X. 152y priorto its storage in its '^^^^J^f^' S^h t at the lower order data will appear in 

:srrdt=^rn^^^^^^^^^^^ 
r::rrh^:r:rs"::^:':^:tr™ 

15 illustrated in Figure 15. „f.ahirh the ohvsicai address of entry 152„ is presented to cache 

Process 294 is next performed by way of w^''^^' *^ P^^^^^ 3ddress eight-byte group, aligned (by 

port 160 along with the portion of the data <^°"^=P°"^7 '0 f owe address eight-byte group. This effects 

shifter 164 in process 292) to the byte lanes --^^P ocess 296 h^^^^^^^^^ the address and data for 

the first write operation required for the ''^f .""^^^^^^ 3tored in the address portion of the 

40 the second operand of the n^'=^"9"«d,^'*^- Pj^"'^^^ 162from entry 152„.shifted 

next write buffer entry 152.,. and the data |stha^^^^^^^^ 

by shifter 164 to the proper byte lanes forthe second access to pon 

then continues (process 298). ^irmnrocessor 10 according to this embodiment of the 

AS noted above, the exception handling abdily of "^'=^°P™^^^f;j° ^^^^^^ Jy,^, js or Is not flushed after 
« invention uses the state of the cont«.i bit DV to ^^^^^^^^^ ^^^^^^^^^^ entry 1 52., does not 

detection of an exception. However, in the case of a ^SliShe v^^d data is contained within the pre- 
have its control bit DV set even if the wnte ^-^ ^^^^^^^ "both 4^^^^^^^^^^ write handling capability and 
ceding (in programcrder) write buffer entry 152 Accor^^^^^^^ 

exception handling as described herein are P^v-ded t^e 2'^°" ™ ^ the next write buffer 

50 trol bit MAW and control bit DV for an entry 1 52„ and. f h ar^ set, mu ^^^^^^^ _ 

■ ::r:;d:in:^LCrnoir^^^^^^^ 

55 

6.2 Gathered writes 

AnothertypeofspecWwHte operation p«^-b,e by mlcropro^ssortO according to tK.e.bodln«n, 
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A ,«h»rpthP data contained Within successive write operations may be gath- 
of the invention is the gathered write, where the , ^^^^^^^ corresponds to a byte le- 

ered into a single write --'^-'^ ° "^^^ri;";';^^^^^^ the sarre block of bytes that may 

cation. If a series of writes are to be ^^^^^^^^^^^^^ 3^t,le to retain the data in the appropriate byte 

be placed on the data bus ^'"^"''^^^"^'y';;^;"^'?^^' Memory may be performed instead of successive 

inbelow. ■ ^ ♦h« aiinration seauence for write buffer 29, mergeable control bit. " 

AS described hereinabove ^ ^ ';Veac,^^^^^^^^^^^ 52 that is performing a write te a contiguous 

MRG. is set at the time of allocation for '^^^^^^^^^^^ ^^te buffer entry 152 previously allocated for 

non-overlapping physical memory ^'^^^'^^"t^^^^^ adjacency con- 

the immediately preceding ^^^^^^^^^^ of the invention In consideration of the X86- 

straints are implemented according ° P;^f^73^^^^^^^^ that write gathering may be implemented in 

compatibility of microprocessor 10 i is same block of bytes is the only necessary 

„rt,e buffer en.„ 152. being J " "J^teesTa TJoJ^ ^ -^y »' »«' °' 

?^:ts«r^rn^rur;s?.rr;'oe.e.,nw^^ 

contiguous non-overlapping wntes. determine if its control bit MRG is set. 

Decision 305 then ''"terrogates the next write buffer^^^^ ^^^^^^^ .^^^ ^.^^ 

,f so. control returns to process 302 where ^^^^^^^^^"^^^^^^^^ ^xisras indicated by either the control 

gather latch 165 in process 304. Once no more mergeab e entr es i ^^^^^^^^ ^^^^^ 

, 'bit MRG or the control bit AV be-jng clear "^tJ^^,'^^^^^^^^^^^ the gathered write operation 

presented to port160.ajong w. h he a^^^^^^^^^^^ 

to cache 60 or mam memory 86, as tne case nwy u 

3°^)- ' . A ^h«Him^nt of the invention, therefore, the efficiency of retiring data to cache 

s „.rr:rcbC:~iSt*;:.^^^^ 

lieu of multiple accesses to contiguous memory locations. 



7. Conclusion 



' Accordingtothepreferredembodlmentofthelnv^^^^^^ 

and the memory system (including ^^^^Tm, ITs o ^ on a high priority basis with mini- 

tion sequence. This enables the cache and memory ^^^f^J^^^^^^^ the buses or memory systems. 

r.um waitstatesdueto non-time-cri«cal -f ^^en l^n Ses ma^^^^^^^^^ that are particularly bene-. 
,n addition, the preferred embodiment ^'^^^^^^^^^^^^^^ of two sections of the write 

« ficialforspecificmicroprocessorarch.^^^^^^^^^^^^ 

buffer for superscalar processors, together with ^^o 3 Jj3j,,,3 ,f the preferred embodiment of the inven- 
program order despite the spHttmg of ^^^^^^d^f^^^^^^^^^ and exceptions, and provision 

. templated that modifications -^^l'^^'^'^^^^^ to those of ordinary skill in the art 

55 na^ves are encompassed within the scope of this invention. 
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Claims 

1. A microprocessor comprising: 



(a) core means for process data according to operations defined by a sequence of mstructions, 
(h\ a \write buffer having a plurality of entries coupled to the core means; 
c a c^heTemory haLg a plurality of memory locations^^^^ 
rd^ a bus coupled to the core means, the write buffer, and the cache memory; 

e control logic coupled to the bus (d). to detect whether an Instruction is a misaligned wnte instruction; 
n a shSupled to the write bufferto shift the contents of a first entry of the plurB >ty of entnes detected 
is a Sgned' write instruction by the control logic, prior to presentation of the first entry to the cache 

Tat a mfsaligned write latch coupled to the shifter and to the cache memory to latch the shffted contents 
of\he eJtry of the plurality of entries and to present the data corresponding to the m,sal.gned wnte 

. nr~e:7o;Smtr^^^^ 

alignedwiTZroIbitthatissetbythecontrollogicinresponsetodetection 

""''Vrte Coprocessor of claim 2. wherein the control logic further sets the misaligned write control bit in 
all of the Diurallty of entries responsive to detecting the misaligned wnte instruction. 

, 4 The m Wcessor of claim 3. wherein the control logic further loads the first entry of the pU^al.ty of 
entrMh^Thysical address in response to the write operation being detected as ^^^^^^'^f^^^^ ^^^^^ 
fute ro ds a s cond entry of the plurality of entries with a higher order phys cal address from that stored 
n the first entry of the plurality of entries to serve as the address for the second wnte cycle 

5 in a Pipelined microprocessor, a method of buffering results of operations executed by a central proc- 

s essing ur^t core ac^lg to a series of instructions, such buffering being effected ". a write buffer having a 
pluralUy of entries and effected prior to storage in a cache memory, comprising the steps of. 
faUdentifvinq whether an instnjction is a misaligned wnte operation; 

S SerSining a f irst physical memory address of a first portion to which results of the misaligned write 

operation identified in step (a) are to be wntten; 
!o Mstorina the first physical memory address in a first writs buffer entry; 

1?) drmining a second physical memory address of a second portion to which results of the misaligned 
write operation identified in step (a) are to be wntten; 

i^s^oTngre;^^^^^^^^^^^ 

35 IPSg the tst and second portions of the operation results from the first write buffer entry into a 

ICesenting the first physical address and the first latched portion of the operation results to the cache 

" TlhrmX^ rcir^w^tiS: first physical address corresponds to a lower order address than 

1'^:t:S:t:^Z, i.r,^.r comprising, prior to step (g). step G) shifting thef irst and second portions 
of the InstrucS?^^^^ so that the first portion of the instructton results resides in higher order byte positions 

" nihrm^lThoTof daim's. ^m^^^^^^^^^^^^ step (K). respons.e to ste'p (a), setting a misaligned wnte 
rnntral bR in the firsTw n^^^^ entry to indicate that the write operation thereto will be a misaligned write. 

9 Ihe ^ethoJS S 8 Wherein step (g) is performed responsive to the misaligned wrfte control bit being 
set in the first write buffer entry. 

" J^crtXSSra^itUssi^ 

incon"dary processing means for processing data according to operations defined by a second^pe 
S progSn^^^^^^^^^ the secondary processing means providing results having a data word greater m 
55 bit width than that provided by the central processing means; 

SrSe bX having a plurality of buffer entries coupled to the central processing means, each entry 
Wud^nta d^^^^^^^^^^^ -suits and an address portion to store physical memory ad- 

dresses at which results are stored; 



24 



8NS0OCID- <EP 0651331A1J_> 



EP 0 651 331 A1 



(d) a cache memory having a plurality of memory locations coupled to the write buffer to receive data there- - 
from and coupled to the central processing means to present data thereto; 

(e) a bus coupled to the central processing means, the secondary processing means, the write buffer, and 
the cache memory; 

(f) secondary data latch means for storing results of the secondary processing means that are written to 
memory at a physical address stored in a buffer entry; and 

(g) routing means for routing the data portion of the write buffer entry or contents of the secondary data 
latch means, to the cache memory. 

11. A method of buffering results of data processing operations executed by a central processing unit and 
a secondary processing unit in a microprxicessor, prior to storage of results in a cache memory of the micro- 
processor through use of a write buffer having a plurality of write buffer entries, each having a data portion 
and an address portion, where results of the secondary processing unit correspond to data words of greater 
bit width than that of the data portion of the write buffer entries, comprising the steps of: 

(a) determining a first memory address to store results of a first instruction; 

(b) storing the first memory address determined in step (a) in the address portion of a first write buffer 
entry; 

(c) executing a first instruction with either the central processing unit or the secondary processing unit; 

(d) responsive to the secondary processing unit executing the first Instruction in step (c), storing results 
in a secondary data latch having a wider bit width than the data portion of the plurality of write buffer entries; 

(e) responsive to the centraf processing unit executing the first instruction in step (c), storing the results 
of the first instruction in the data portion of the first write buffer entry; 

(f) retrieving results of the first instruction from the write buffer to store in the cache memory by selecting 
the contents of the secondary data latch if the first instruction was executed by the secondary processing 
unit, or selecting the contents of the data portion of the first write buffer entry if the first instruction was 
executed by the central processing unit; and ... 

(g) presenting the contents selected in step (1) to the cache memory in combination with the first physical 
address stored in the first write buffer entry. 

12, A microprocessor comprising: . ^ . . ^ u 

(a) central processing means for processing data according to operations defined by instructions to be 

executed in a program orden _ , j . 

(b) a write buffer including a plurality of buffer entries arranged in first and second sections, coupled to 

the central processing means; 

(c) a cache memory having a plurality of memory locations coupled to the write buffer and the central proc- 
essing means; . u j 

(d) a bus coupled to the central processing means, the write buffer, and the cache memory; and. 

(e) control logic means for controlling the write buffer so that instruction results stored therein are pre- 
sented to the cache memory in program order. 

13 In a microprocessor, a method of buffering results of data processing operations-executed by a central 
processing unit core according to a series of instructions in a program order, such buffering being pnorto stor- 
aqe In a cache memory of the microprocessor, comprising the steps of: 

(a) for a plurality of instructions, determining a physical address to which instruction results are to be writ- 
lb) for each physical address determined in step (a), storing the deterWned physical address Into one of 
a plurality of write buffer entries, arranged into first and second sections; . 

(c) executing the instructions; , . , ^ ■ » 

(d) storing results into the write buffer entries in which is stored the physical address for the instruction. 

(eTretrieving. in the program order, the results from the write buffer entries in step (d), for storage in the 
cache memory at a location associated with the stored memory address. 

14 In a microprocessor, a method of buffering results of data processing operations executed by a central 
processing unit core according to a series of instructions in a program order, the instructions including wntes 
to memory and non-cacheable reads from memory, such buffering being prior to storage in a cache memory 
of the microprocessor, comprising the steps of: 

(a) determining a physical address to which instruction results are to be written in memory; 
b storing the physical address determined in step (a) into one of a plurality of write buffer entnes and 
setting an address valid control bit in the write buffer entry In which the physical address is stored; 
(c) determining the physical address of the memory location from which the non-cacheable read is to be 
accessed for each non-cacheable read instruction: 
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(d) loading a non-cacheable read dependency field having a plurality of bit positions, each bit position cor- 
i^sponding to one of the write buffer entries, with the state of the address valid control b,ts for the corre- 
spending write buffer entry; 

executina the series of instructions; ^ ... . l 

0 stSng he results into the write buffer entry in which is stored the physical address or the mstrucfon; 

gurevingjnprogramorder.thestoredresultsfromthewri^^ 

at at ca^Sn assocfated with the stored memory address and clearing the bit .n the non-cacheable read 
dependency field corresponding to the retrieved write buffer entry; and 

(h) performing the non-cacheable read, responsive to the non-cacheable read dependency f.eld being 
io clear. 

• Ja) ^^^XSZ processing data according to operations defined by a sequence of in- 

fhl"r'vl^lte buffer having a plurality of buffer entries .coupled to the central processing means; 
.5 (c) a ^^he m?mory having a plu^lity of memory locations, coupled to the write buffer and to the central 

^dTrbutLuS to the central processing means, the write buffer, and the cache memory; 

(e) controro^lfmears for detecUng that first and second instructions include memory wntes to addresses 

.0 To gatheTd StsSme"ans for storing a data portion of the first and second instructions responsive to 
L S logic means, and for presenting its contents to the memory cache in a single write cycle. 
16 ira ptoelfned microprocessor, a method of buffering results of data processing operations executed 
bv a intra! p ocesJing init core according to a series of instructions, such buffering being effected .n a wnte 
buffefha^ig of write buffer entries and effected prior to storage in a cache memory of the m.cro- 

plessorwherfdatals communicated in a byte group from the write buffer to the cache memory .n a wnte 

'"'iSSh^at^^^^^^^^^^^^ instructions include memory writes to addresses in the same byte group 
. rbl deSSng first and second physical memory addresses to which results of the .nstrucfon are to be 

30 SSn^S and second physical addresses in first and second write buffer entries, respectively; 
M\ executina the first and second instructions; ,x i. 

le) results of the f irst and second instructions in the f irst and second wnte buffer entnes. re- 

spectiveiy; 

es and the results latched In step (0 to the cache memory in a wnte cycle. 

Suctions sTthafa Mitoback stage and an address calculation stage of a first and a second program .n- 
Af) ctnirtion resoectivelv are processed substantially sinnuitaneously; 

tu^^Z^-^^^^ 'plurality of buffer entries coupled to the central processing P'Pel- -a^ 
S a ^1 mLry having a plurality of memory locations coupled to the write buffer and to the central 

fdTrbtcoX to the central processing pipeline means, the write buffer, the -^^^^^^^^^^ 
e control logic means for comparing a physical address of a /P.^^''^^^^^^^^^^^ 
struction in the address calculation stage to addresses associated with each of the Pl" W of buffer 
?ries to detect a read-after-write data dependency between the first and second instructions 
8 A m^^^^^^^^ of data'processing operations executed by -tral^proce^,^^^^^^^^^^ core 

in a mi^mprocessor priorto storage in a cache memory of the microprocessor, compnsing the steps of. 

Ij'determining a second memory address from which data is to be read for a second instruction, the sec- 
ond instruction beina later in program order than the first instructionj 

55 dTCarir^^^^^^^^ 

in the first entry of the plurality of write buffer entries to detect a match; 

(e) executing a first instruction to produce a first result; 

(f) storing the first result in the first write buffer entry; and, 

26 
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(g) retrieving the first result from the first write buffer entry for storage in the cache mennory at a location 
associated with the first nnennory address. 

(a) cen rai pi a ^ instruct ons is of a conditional branch type, 
cution ,n a program order, J^^^^ ^^^^^^^^^^ ,,,p,3d to the central processing unit core to receive 

(b) a write buffer including a plurality °| ^^/^^^^"^^^^^^^ ^, „raiity of buffer entries including 
data therefrom Jf^^^^^^^^^^^ hat data to be'writte'n to 'rts buffer entry is from 

:rt;ora^;"^^^ 

" IJ) "cache memory having a plurality of memory locations coupled to the v.rite buffer and to the central 

?dT?bt"?oupTed to the central processing means, the write buffer, and the cache memory; and, 

(d) a bus ^°"PI^^;°;": ; J nresentation of data by the write buffer to the cache memory so that 

ilhtt Xr en pri e'nts ^ memory only if the speculation control bit is not set 

"o^iSresu.^^^^^^^^^^^ 

^;rtr:rhS=^^^ 

pZ to storage in a cache memory of the microprocessor, compnsmg the steps of. 
SpSnVa~ntri— 

Sra«™^ corresponding to a write to memory, determining a first physical memory 
address to which results are to be written in memory; 
25 (d) storing the first physical address in a first write buffer entry; 

e executing the write to memory instruction predicted m step (b). 
)ft storing the results of step (e) in the first write buffer entry; 
perming the condition upon which the condition^ branch inst^ction depends, 
h responsive to step (g). indicating that step (b) was correct, and 
30 (i) retrieving the results of the ^f;^^ ^:X^S:^^r having a central processing 
21. A method of handling ^^^f " ^^^^^^^^^^^ of instructions, the microprocessor in- 

rdirare^^"^^^^^^ 

central pr^Iss nTunrcore prior to storage in a cache memory, comprising the steps of: 
35 S ^t^S a f irst mLory address in which to store results of a f .st instruction; 

)w storing the first memory address in a first write buffer entry; 
c detecting an exception condition prior to execution of the f rst instruction; and, 
(d) responsive to step (c). invalidating the first write buffer entry. 
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FIG. 1b 
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FIG. 12b 
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